A Context-Aware Topic Model for Statistical Machine Translation

نویسندگان

  • Jinsong Su
  • Deyi Xiong
  • Yang Liu
  • Xianpei Han
  • Hongyu Lin
  • Junfeng Yao
  • Min Zhang
چکیده

Lexical selection is crucial for statistical machine translation. Previous studies separately exploit sentence-level contexts and documentlevel topics for lexical selection, neglecting their correlations. In this paper, we propose a context-aware topic model for lexical selection, which not only models local contexts and global topics but also captures their correlations. The model uses target-side translations as hidden variables to connect document topics and source-side local contextual words. In order to learn hidden variables and distributions from data, we introduce a Gibbs sampling algorithm for statistical estimation and inference. A new translation probability based on distributions learned by the model is integrated into a translation system for lexical selection. Experiment results on NIST ChineseEnglish test sets demonstrate that 1) our model significantly outperforms previous lexical selection methods and 2) modeling correlations between local words and global topics can further improve translation quality.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Topic-Triggered Language Model for Statistical Machine Translation

Language model is an essential part in statistical machine translation, but traditional n-gram language models can only capture a limited local context in the translated sentence, thus lacking the global information for prediction. This paper describes a novel topic-triggered language model, which takes into account the topical context by estimating the n-gram probability under the given topics...

متن کامل

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

Using Features from Topic Models to Alleviate Over-Generation in Hierarchical Phrase-Based Translation

In hierarchical phrase-based translation systems, the grammars (SCFG rules) have over-generation problem because we can replace the non-terminalX with almost everything without knowing the syntactic or semantic role ofX . In this paper, we present an approach that uses topic models to learn the distributions for non-terminals in each SCFG rule, based on which we further derive static features f...

متن کامل

Virtual Babel: Towards Context-Aware Machine Translation in Virtual Worlds

In this paper, we describe our ongoing research project of Virtual Babel, a contextaware machine translation system for Second Life, one of the most popular virtual worlds. We augment the Second Life viewer to intercept the incoming/outgoing chat messages and reroute the message to a statistical machine translation server. The returned translations are appended to the original text message to h...

متن کامل

ارائه یک رتبه‌بند برای خطایاب معنایی با استفاده از ویژگی‌های حساس به متن

Nowadays, a large volume of documents is generated daily. These documents generated by different persons, thus, the documents contain spelling errors. These spelling errors cause quality of the documents are decrease. Therefore, existence of automatic writing assistance tools such as spell checker/corrector can help to improve their quality. Context-sensitive are misspelled words that have been...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015